Word embedding is a collective term for unsupervised ML models that learn to map a set of words $w$ in a vocabulary (or phrases, stems, lemmas) to vectors of numerical values. This approach reduces the number of dimensions from the number of unique words $V$ (i.e. vocabulary) to a much lower value $N$ where dimensions are shared by all words, such that vectors $w^{\prime}$ are not orthogonal any more. In addition, the way embeddings are computed, the ML models discover patterns in the word relations (such as given a context).
Given a word vocabulary $W=\{w_i\}$, $\|\texttt{Vocabulary}\|=V$, each word is represented by a unit vector (e.g. one-hot encoding in ML) such as
$w_1=\begin{bmatrix}1 \\ 0 \\ \vdots \\ 0\end{bmatrix}$, $w_2=\begin{bmatrix}0 \\ 1 \\ \vdots \\ 0\end{bmatrix}$, $w_N=\begin{bmatrix}0 \\ 0 \\ \vdots \\ 1\end{bmatrix}$, $W\in \mathbb{R}^{V \times V}$
Then the word embedding maps this representation $w_i$, $i=1,2,\dots,V$ to another representation (through Skip-gram or CBOW modeling) where the vectors are in a smaller dimension $N$, and $N << V$, and vectors $w^{\prime}_i$, $i=1,2,\dots,V$, with $\|W^{\prime}\|=N$, and $W^{\prime}\in \mathbb{R}^{N \times V}$
Both Skip-Gram and Continuous Bag of Words (CBOW) models use a neural network architecture to model the word mapping from the original $W$ to the embedding $W^{\prime}$
Word representation represents the word in a vector space $W^{\prime}$ so that if the word vectors are close to one another means that those words are related to one other.
Recall that above representation of words also used in the Tf-Idf features matrix where each column is a word and each row is a document or a sentence.
Note: Check
An n-gram is a contiguous sequence of n items from a given sample of text. The items can be letters, syllables, or words according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams is also called shingles.
Text similarity:
Define the context referring to a symmetric window centered on the target word $w_t$, containing the surrounding tokens at a distance less than some window size $\texttt{ws}$ : $C_t = \{w_k | k \in [t-\texttt{ws}, t+\texttt{ws}]\}$
Skip-gram model predicts the context words within a specific window given the current word. The input layer of the neural network model uses the current word and the output layer uses the context words. The hidden layer contains nodes matching the number of dimensions $N$.
Skip-gram learns by predicting the context for a given target word maximizing $\prod\limits_{t=1}^{T}P(C_t|w_t)$.
"The man who passes the sentence should swing the sword." Ned Stark
Sliding window size $\texttt{ws}= 5$
| Window | Target | Context |
|---|---|---|
[The,man,who] |
the | man,who |
[The,man,who,passes] |
man | the,who,passes |
[The,man,who,passes,the] |
who | the,man,passes,the |
[man,who,passes,the,sentence] |
passes | man,who,the,sentence |
[sentence,should,swing,the,sword] |
swing | sentence,should,the,sword |
[should,swing,the,sword] |
the | should,swing,sword |
[swing,the,sword] |
sword | swing,the |
Continuous Bag of Words model predicts the current word given the context words within a specific window. The input layer uses the context words and the output layer uses the current word. The hidden layer is of length $N$. CBOW is the opposite of Skip-gram.
The CBOW model tries to predict the target word given its context, maximizing the likelihood $\prod\limits_{t=1}^{T}P(w_t|C_t)$
To model Skip-gram or CBOW probabilities, a Softmax activation is used on top of the inner product between a target vector $\texttt{u}_{wt}$ and its context vector $\frac{1}{C_t}\sum_{w \in C_t}\texttt{v}_w$
Disadvantage: A limitation of word embeddings is that possible multiple meanings of a word are conflated into a single representation (unlike Wordnet which this knowledge is carried in the graph)
Solution: Develop a data-driven WordNet such as based on word embeddings.
"You shall know a word by the company it keeps." J.R. Firth
Word2vec techniques use the context of a given word to learn its semantics. Also, Word2vec learns numerical representations of words by looking at the words surrounding a given word.
Imagine in an exam and the following sentence is encountered: "Mary is a very stubborn child. Her pervicacious nature always gets her in trouble." What does pervicacious mean?. The phrases surrounding the word of interest is important. In our example, pervicacious is surrounded by stubborn, nature, and trouble. These three words is enough to determine that pervicacious in fact means a state of being stubborn.
Gensim = "Generate Similar" is a topic modeling library to implement Latent Semantic Methods and it is license under GNU LGPLv2.1 license.
Word2Vec module of gensim can generate CBOW and skip-gram models. Here is the
API.
Note that Word2Vec is not removing stop words because the algorithm relies on the broader context of the sentence in order to produce high-quality word vectors.
Word2vec processes text by vectorizing words and generates feature vectors that represent words in the corpus. Similarly, Doc2Vec processes the entire variable-length document by vectorizing documents into fixed length feature vectors. Doc2Vec uses a similar scheme to Word2Vec by extending the skip-gram or CBOW with an additional document/paragraph vector $D$. During the training of words as in Word2Vec the document vector $D$ is also trained and thus it represents the document. Here is the API.
Install gensim: conda install gensim python-levenshtein
Let's load nltk dataset named abc and use gensim to generate word embeddings with CBOW approach.
%%time
import nltk
import gensim
print(f'gensim version= {gensim.__version__}')
from gensim.models import Word2Vec
from nltk.corpus import abc
sents = list(abc.sents())
model = Word2Vec(abc.sents(), min_count=2, workers=4)
X = list(model.wv.index_to_key)
# Sanity
print(f'ABC dataset has {len(sents)} sentences')
print(f'gensim model vocabulary has {len(X)} words mapped to N= {model.vector_size} dimensions')
gensim version= 4.3.0 ABC dataset has 29059 sentences gensim model vocabulary has 19484 words mapped to N= 100 dimensions CPU times: total: 18.9 s Wall time: 15.3 s
# The closest words to the word 'science'
science = model.wv.most_similar('science')
print(science)
[('agriculture', 0.9628815650939941), ('Coalition', 0.9452467560768127), ('law', 0.9432271718978882), ('management', 0.9409207701683044), ('textile', 0.9401235580444336), ('biosecurity', 0.9397401213645935), ('general', 0.9383472204208374), ('descend', 0.9369747042655945), ('bulk', 0.9362186789512634), ('education', 0.9359689950942993)]
# Distance between computer and science
science12 = model.wv.similarity('science', 'computer')
print(science12)
0.7867699
Let's see another example from Shakespeare's play Hamlet using CBOW and skip-gram methods, respectively.
from nltk.corpus import gutenberg
sents = list(gutenberg.sents('shakespeare-hamlet.txt'))
print(sents[0])
['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']']
%%time
# CBOW model
model1 = Word2Vec(sents, vector_size=200, sg=0, window=13, min_count=1, epochs=20, workers=4)
# Skip-gram model
model2 = Word2Vec(sents, vector_size=200, sg=1, window=13, min_count=1, epochs=20, workers=4)
CPU times: total: 8.2 s Wall time: 2.76 s
Let's find out what the model tells us when the context is the names: ['Hamlet', 'Ophelia', 'Ghost']
similarities1b = model1.wv.most_similar(positive=['Hamlet'], topn=20)
similarities1 = model1.wv.most_similar(positive=['Hamlet', 'Ophelia', 'Ghost'], topn=20)
similarities2 = model2.wv.most_similar(positive=['Hamlet', 'Ophelia', 'Ghost'], topn=20)
# Clean the stop words
def filter_words(_sim):
from nltk.corpus import stopwords
import re
stop_words = set(stopwords.words('english'))
return [(w,p) for w,p in _sim if w.lower() not in stop_words and re.search(r'^[a-zA-Z]{3,}$',w) != None]
similarities1b = filter_words(similarities1b)
similarities1 = filter_words(similarities1)
similarities2 = filter_words(similarities2)
for (w1,s1),(w1b,s1b),(w2,s2) in zip(similarities1, similarities1b, similarities2):
print(f'{w1:16s}{s1:.3f}\t\t{w1b:16s}{s1b:.3f}\t\t{w2:16s}{s2:.3f}')
Horatio 0.995 Horatio 0.989 Rosincrane 0.912 Manet 0.993 Manet 0.985 Claudius 0.847 Thankes 0.991 Welcome 0.984 Sister 0.844 Welcome 0.990 twaine 0.983 Osricke 0.844 twaine 0.990 Ophelia 0.983 Attendant 0.840 Reynoldo 0.989 feast 0.980 Queene 0.829 Words 0.986 murdering 0.979 Manet 0.828 goodnight 0.986 Cell 0.979 Voltumand 0.820 Processe 0.986 Thankes 0.979 Polonius 0.818 Things 0.986 goodnight 0.979 Farewell 0.806 drinkes 0.986 Reynoldo 0.979 Guildenstern 0.803 Barnardo 0.986 yong 0.978 Lucianus 0.801 sickly 0.986 standing 0.978 Laertes 0.798 foure 0.986 Ghost 0.978 Marcellus 0.797 remain 0.986 Words 0.978 Coffin 0.795 Thunder 0.985 Things 0.977 Welcome 0.794 earthly 0.985 sickly 0.977 Drumme 0.794 ayde 0.985 Peece 0.976 Foyles 0.788 Actus 0.985 earthly 0.976 Horatio 0.787
# The word embedding matrix
words1a = [w for w,s in similarities1] + ['Hamlet']
X1a = model1.wv[words1a]
words2a = [w for w,s in similarities2] + ['Hamlet']
X2a = model2.wv[words2a]
# Sanity
print(X1a.shape)
(21, 200)
We can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots using PCA. The visualizations can provide a qualitative diagnostic for our learned model.
Let's train a projection method on the vectors.
from sklearn.decomposition import PCA
pca_model1a = PCA(n_components=2).fit_transform(X1a)
pca_model2a = PCA(n_components=2).fit_transform(X2a)
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 72
def plot_pca(_pca_model, _words, _title):
plt.scatter(_pca_model[:, 0], _pca_model[:, 1])
for i, word in enumerate(_words):
plt.annotate(word, xy=(_pca_model[i, 0], _pca_model[i, 1]), c=('r' if word=='Hamlet' else 'k'))
plt.title(_title)
plt.figure(figsize=(12, 6), dpi=72)
ax=plt.subplot(1, 2, 1)
plot_pca(pca_model1a, words1a, 'CBOW Model')
ax=plt.subplot(1, 2, 2)
plot_pca(pca_model2a, words2a, 'Skip-gram Model')
plt.show()
dissimilarities1 = model1.wv.most_similar(negative=['Hamlet'], topn=20)
dissimilarities2 = model2.wv.most_similar(negative=['Hamlet'], topn=20)
words1b = ['Hamlet'] + [w for w,s in similarities1] + [w for w,s in dissimilarities1]
X1b = model1.wv[words1b]
words2b = ['Hamlet'] + [w for w,s in similarities2] + [w for w,s in dissimilarities2]
X2b = model2.wv[words2b]
pca_model1b = PCA(n_components=2).fit_transform(X1b)
pca_model2b = PCA(n_components=2).fit_transform(X2b)
plt.figure(figsize=(20, 10), dpi=72)
ax=plt.subplot(1, 2, 1)
plot_pca(pca_model2a, words2a, 'CBOW Model')
ax=plt.subplot(1, 2, 2)
plot_pca(pca_model2b, words2b, 'Skip-gram Model')
plt.show()
print(model1.wv.most_similar(positive=['Alas', 'poor', 'Yorick', 'Horatio'], topn=20))
[('sweet', 0.9958095550537109), ('face', 0.9955286979675293), ('yong', 0.9954279661178589), ('get', 0.9951246380805969), ('Sister', 0.9950670599937439), ('toward', 0.9948976039886475), ('false', 0.9948681592941284), ('heartily', 0.9947859644889832), ('!', 0.9946458339691162), ('ioyes', 0.9945765137672424), ('hit', 0.9945618510246277), ('O', 0.9945528507232666), ('Now', 0.9944879412651062), ('here', 0.994389533996582), ('Goodnight', 0.9943574070930481), ('twelue', 0.9942129254341125), ('Thou', 0.9942125082015991), ('Gertrude', 0.9942075610160828), ('Can', 0.994149923324585), ('Ere', 0.9941392540931702)]
print(model1.wv.most_similar(negative=['Alas', 'poor', 'Yorick', 'Horatio'], topn=20))
[('Brooch', 0.9245432019233704), ('Iemme', 0.9202792644500732), ('range', 0.9040265083312988), ('Minister', 0.8987101316452026), ('Scourge', 0.8897431492805481), ('Sphere', 0.8873246312141418), ('punish', 0.8795083165168762), ('Happy', 0.8684981465339661), ('moues', 0.8507909774780273), ('therfore', 0.8478710055351257), ('warmes', 0.8235416412353516), ('Barbars', 0.8225621581077576), ('kin', 0.818241536617279), ('Asking', 0.7901270389556885), ('key', 0.7899453043937683), ('Conference', 0.7849299311637878), ('coniunctiue', 0.7842671871185303), ('Lookes', 0.7804279327392578), ('caution', 0.7791343927383423), ('Button', 0.7767291069030762)]
Do you notice any meaningful word from the context ['Alas', 'poor', 'Yorick', 'Horatio'] as in above?
What is that 'O'?
In the previous lectures we used Tf-Idf features and classified six news categories in Reuters corpus. Now let's see how we could use Word2Vec generated features.
The word embeddings being size $N$, given a set of vector embeddings $v_i, i=0\dots k$ from a document $d$ with $k$ words, and its feature vector $\text{fv}$,
# borrowed from previous lectures
from nltk.corpus import reuters
from collections import Counter
import numpy as np
import pandas as pd
Documents = [reuters.raw(fid) for fid in reuters.fileids()]
# Categories are list of lists since each news may have more than 1 category
Categories = [reuters.categories(fid) for fid in reuters.fileids()]
CategoriesList = [_ for sublist in Categories for _ in sublist]
CategoriesSet = np.unique(CategoriesList)
print(f'N documents= {len(Documents):d}, K unique categories= {len(CategoriesSet):d}')
counts = Counter(CategoriesList)
counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)
# Build the news category list
yCategories = [_[0] for _ in counts[:5]]
yCategories += ['other']
# Sanity check, M=29K
print(f'K categories for classification= {len(yCategories):d} {yCategories}')
N documents= 10788, K unique categories= 90 K categories for classification= 6 ['earn', 'acq', 'money-fx', 'grain', 'crude', 'other']
# nltk.download('reuters')
# Assign a category for each news text
yCat = []
for cat in Categories:
bFound = False
for _ in yCategories:
if _ in cat:
yCat += [_]
bFound = True
break # So we add only one category for a news
if not bFound:
yCat += ['other']
# Sanity check
print(f'N categories= {len(yCat):d}')
N categories= 10788
# Convert to numerical np.array which sklearn likes
ydocs = np.array([yCategories.index(_) for _ in yCat])
from nltk import word_tokenize
Sentences = [word_tokenize(doc) for doc in Documents]
%%time
# CBOW model
model = Word2Vec(Sentences, vector_size=300, sg=0, window=9, min_count=1, epochs=20, workers=4)
CPU times: total: 1min 26s Wall time: 23.7 s
# Use the mean of word vector that makes up a sentence or a document
# Note that there are better ways to use the word vector as a feature vector - such as doc2vec in gensim
Xdocs = np.array([np.mean([model.wv[word] for word in doc], axis=0) for doc in Sentences])
print(Xdocs.shape)
(10788, 300)
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
def kfold_eval_docs(_clf, _Xdocs, _ydocs):
# Need indexable data structure
acc = []
kf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
for train_index, test_index in kf.split(_Xdocs, _ydocs):
_clf.fit(_Xdocs[train_index], _ydocs[train_index])
y_pred = _clf.predict(_Xdocs[test_index])
acc += [accuracy_score(_ydocs[test_index], y_pred)]
return np.array(acc)
%%time
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
acc = kfold_eval_docs(nb, Xdocs, ydocs)
print(f'Naive Bayes CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Naive Bayes CV accuracy= 0.685 ±0.032 CPU times: total: 453 ms Wall time: 461 ms
%%time
from sklearn.ensemble import RandomForestClassifier
n_cores = 8
rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs, ydocs)
print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Random Forest CV accuracy= 0.883 ±0.014 CPU times: total: 20min 49s Wall time: 2min 41s
%%time
from sklearn.svm import SVC
svm = SVC(kernel='rbf', gamma='scale', class_weight='balanced')
acc = kfold_eval_docs(svm, Xdocs, ydocs)
print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Support Vector Machine CV accuracy= 0.888 ±0.013 CPU times: total: 48.2 s Wall time: 50.3 s
%%time
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LogisticRegression
# To avoid non-convergence one has to increase 'max_iter' parameter
lr = LogisticRegression(solver='sag', multi_class='auto', max_iter=500, class_weight='balanced')
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=ConvergenceWarning)
acc = kfold_eval_docs(lr, Xdocs, ydocs)
print(f'Logistic Regression CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Logistic Regression CV accuracy= 0.900 ±0.008 CPU times: total: 4min 13s Wall time: 4min 10s
Notice In this seen-before problem, the classification performance is almost as good as the previous lectures and it runs much faster since we have only 300 vector size (original M was 29016).
In the previous cell we used Word2Vec features and classified six news categories in Reuters corpus. Now let's see how we could use Doc2Vec generated features.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# Doc2Vec expects TaggedDocument input data structure, every document is a list of words and tagged with int ID
DocumentsTagged = [TaggedDocument(word_tokenize(reuters.raw(fid)), [i]) for i, fid in enumerate(reuters.fileids())]
model2 = Doc2Vec(DocumentsTagged, vector_size=100, window=9, min_count=1, epochs=20, workers=4)
# Build X from Doc2Vec document vectors
Xdocs2 = np.array([model2.dv[_.tags[0]] for _ in DocumentsTagged])
print(Xdocs2.shape)
(10788, 100)
rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs2, ydocs)
print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
acc = kfold_eval_docs(svm, Xdocs2, ydocs)
print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Random Forest CV accuracy= 0.727 ±0.019 Support Vector Machine CV accuracy= 0.775 ±0.017
# Try another model
model3 = Doc2Vec(DocumentsTagged, vector_size=100, dm=0, window=9, min_count=1, epochs=20, workers=4)
Xdocs3 = np.array([model3.dv[_.tags[0]] for _ in DocumentsTagged])
print(Xdocs3.shape)
rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs3, ydocs)
print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
acc = kfold_eval_docs(svm, Xdocs3, ydocs)
print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
(10788, 100) Random Forest CV accuracy= 0.882 ±0.016 Support Vector Machine CV accuracy= 0.913 ±0.014
Notice In this approach the classification performance is almost as good as the previous results (or better) and it runs much faster since we have only 100 vector size (original M was 29016).
Exercise 1. So many different approaches can be utilized using the Word2Vec generated word vectors and document vectors, such as generating the minimum and maximum vector values (magnitude-wise), doubling the dimension of the feature vectors (from $M$ to $2M$), etc.
Top2vec and compare to Doc2vec%%html
<style>
table {margin-left: 0 !important;}
</style>
<!-- Display markdown tables left oriented in this notebook. -->